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This ,tudy developed, applied, and evaluated a theory-based 
method of detecting the underlyiag causes of differential 
difficulty. The method v/as intended to improve on traditional 
approaches that too often produce uninterpretable results. 

Key elements of the method were the analysis of item clusters 
and the incorporation of theoretical predictions about cluster 
performance. The method was applied in two subgroups taking 
SAT-M and involved (1) reviewing literature syntheses to 
identify factors that might cause differential item 
functioning, (2) forming item categories based on those 
factors, (3) identifying categories that functioned 
differentially, (4) assessing the functioning of the items 
composing deviant categories, and (5) relating item and 
category functioning. Results were com.pared to a traditional 
item-level analysis. In both subgroups, the', cluster and 
traditional methods agreed on the overall extent of 
differential functioning (substantial in the first group, 
virtually absent in the second). Additionally, the pattern of 
differential functioning detected v/as interpretable . At the 
same time, several important limitations were apparent. The 
method would seem to be applied most productively when a small 
number o.f hypotheses can be derived from a reasonably strong 
research base, overlap among cluster structures can be 
avoided, and results can be supplemented with experimental 
studies or protocol analys.es. 
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Clusters as the Unit of Analysis 
in Differential Item Functioning 

Most traditional approaches to differential item 
functionin', are built on the evaluation of relatively 
unreliable, individual items (individual items are unreliable 
in that they are generally poor indicators of the content 
universe or construct they are intended to measure) . When 
differential functioning is being studied for several groups 
simultaneously (e.g., Black examinees, women, Hispanic 
candidates, handicapped people) , or when subgroup sample sizes 
are small, this limitation can have particularly severe 
consequences. Because individual items are relatively 
unreliable, numerous test questions will show statistically 
significant evidence of differential functioning by chance 
alone. Interpreting on a post hoc basis the resulting mix of 
false and true-pos'i ti\ > ■ items has proved very difficult, 
resulting in little success locating the factors underlying 
differential functioning across test questions. 

To increase the chances of identifying underlying 
factors, investigators have implemented both experimental, 
theory-based methods and approaches based on the analysis of 
item clusters (e.g., Scheuneman, 1985; Schmeiser, 1981; Wild, 
1987a) . Building upon these conceptual advances, we propose 
an a priori, theory-based method built upon item clusters (see 
Bennett, Rock, & Kaplan, 1987, for an initial version of the 
methodology) . Clusters are suggested as the unit of analysis 
because they are more reliable than individual items (more 
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indicators of the universe are presented) . A theory-based 
approach is advocated because it forces hypotheses about 
differential functioning to be stated in advance of the data 
analysis. In combination, the use of clusters and theory 
should reduce the frequency of false positives, make more 
systematic the search for underlying causes, and provide 
better information for program policy decisions concerning the 
modification or possible elimination of broad item classes 
found to operate differentially for one or another group. The 
purpose of this study was to develop, try out, and evaluate 
such a theory-based method using items from the Scholastic 
Aptitude Test (SAT) . 

Subjects 

Subjects were members of two groups for whom differential 
difficulty currently is a concern. For the first group, 
visually impaired students taking the braille edition of the 
SAT, instances of differential difficulty have been found on 
the test's Mathematical section (Bennett, Rock, & Kaplan, 

1987) . These instances did not appear to be associated with 
items possessing any single, distinctive characteristic. 

Hence, a closer look at the performance of this group seems 
warranted . 

Differential difficulty for visually impaired candidates 
is particularly hard to evaluate via traditional methods 
because so few of these examinees take the braille edition and 
because content, format, and administration (i.e., timing) 
effects may be confounded. On the first count, the group 
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poses an appropriate methodological challenge because of the 
increased stability associated with analyzing clusters instead 
of items, an increase that should be particularly valuable 
with small samples. The potential confounding, however, poses 
a serious problem — as it would for any traditional approach-- 
and it may be that the causes of any significant observed 
effects can only be fully understood through experimental 
manipulations . 

The second group is composed of Black students. Numerous 

studies have focused upon the functioning of SAT Verbal items 

/ 

with several investigations finding evidence of differential 
difficulty (e.g., Dorans, 1982; Kulick, 1984; Rogers & Kulick, 
1986; Schmitt, Bleistein, & Scheuneman, 1987). Far less 
attention has been paid, however, to the performance of this 
group on the Mathematical section. That differential 
difficulty ’^ight occur on the Mathematical section is 
suggested by the results of research on other mathematics 
tests taken by high school and college-age populations 
(Scheuneman, 1978, 1985; Shepard, Camilli, & Williams, 1984). 

Visually impaired subjects were drawn from a pool of 
students taking special, extended-time administrations of SAT 
forms WSA3 , WSA5, and CSA5. The WSA forms were administered 
from March 1980 through June 1983 and the CSA form from 
October 1983 through September 1986. (A second CSA form, 

CSA7, was taken by too few visually impaired students to make 
analysis worthwhile.) Visually impaired students were 
eliminated from the pool if they requested a special test 
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edition other than braille (e.g., cassette) or requested the 
braille edition in conjunction with another edition (e.g., 
large type) . (Those who requested ^ he braille edition along 
with a regular print copy for use by a reader were retained.) 
Further, students were eliminated if they indicated on the 
Student Descriptive Questionnaire that English was not their 
best language. The resulting samples consisted of 91 students 
for WSA3, 96 for WSA5, and 74 for CSA5 . 

The performance of each of these handicapped samples was 
compared to a random sample of high school students who took 
the regular print versions of the same test forms under 
standard conditions. For WSA3 , the reference group consisted 
of 1,110 students randomly drawn from a two-state 
administration in October 1974; for WSA5, 1,398 examinees were 
selected randomly from the equating bank for a national 
administration in December of that same year (equating banks 
are large random samples used for placing forms on the SAT 
score scale) . The CSA5 sample also was drawn randomly from 
the equating bank for a national administration in October 
1980; 5,507 students composed this sample. All three samples 
were drawn to conform to the proportions of seniors in the 
handicapped groups; approximately .73 for WSA3 , .05 for WSA5, 

and .91 for CSA5. As for visually impaired examinees, 
students indicating that English was not their best language 
were eliminated from the reference samples. 

Black subjects were randomly selected from those students 
taking standard administrations of three test forms: CSA5, 
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CSA7 , and GSA2 . The CSA5 national administration date is 
given above, while CSA7 was administered nationally in 
December 1980 and GSA2 in January 1984. Random samples of 
approximately 5,000 examinees were drawn from the appropriate 
equating banks. These samples were then separated into Black 
and White groups. Finally, the White sample was adjusted (by 
deleting examinees) to produce the same proportions of juniors 
and seniors as the Black g 'oup. The resulting samples were 
for Black examinees, 446, 834, and 705, and for White 
examinees, 4,405, 4,798, and 3,985, respectively. Proportions 
of seniors were approximately .90, .79, and .96 for CSA5, 

CSA7 , and GSA2 , respectively. 

Tables 1 and 2 present background information on the 
study and reference groups. For the visually impaired 
samples, perhaps the most obvious characteristic is their 
extremely small size — even though data were pooled across the 
three years that each form remained in service. Second, this 
group is consistently older than the reference sample, 
suggesting that these students take longer to progress through 
school. Finally, their SAT-M scores are consistently and 
substantially lower than their nonhandicapped peers (though by 
different amounts) while their SAT-V scores are not. 



Insert Tables 1 and 2 about here 



for the Black examinee samples, several characteristics 
stand out. First, males are underrepresented relative to tt e 
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reference group. Second, as in many previous analyses, 
Mathematical and Verbal ccores, and self-reported math grades, 
are noticeably lower than the reference group's values (for 
scores, the differences approximate 1 standard deviation while 
for grades, they range from .20 to .37 standard deviations). 
Finally, the number of years of math taken is somewhat greater 
for the White group on two of the three test forms. This last 
result is particularly noteworthy as many studies have 
documented lower Black enrollments especially in precollege 
math courses (e.g., Johnson, 1984; Jones, 1984; Jones, Burton, 
& Davenport, 1984; Matthews, Carpenter, Lindquist, & Silver, 
1984) . For students taking the SAT, a self-report of specific 
course type was not requested until the advent of the GSA 
foi Tis . Review of these data confirm that Black students 
administered GSA2 take fewer years of advanced math courses 
(i.e., trigonometry, precalculus, calculus) and more years of 
"other" math courses than their White peers. 

Method 

This study involved two major steps. First, each set of 
test forms was analyzed using the cluster-based method. 

Second, the method was evaluated to determine its utility for 
studying differential functioning. 

The Cluster-Based Method 

The cluster-based method involved (1) reviewing 
literature syntheses to identify factors that might cause 
differential item functioning, (2) forming item categories 
based on these factors, (3) identifying categories that 
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functioned differentially, (4) assessing the functioning of 
the items composing errant categories (when such categories 
were found) , and (5) relating item and category functioning 
(again, conditional on discovering deviant categories). 

Identifying relevant subgroup factors . To identify 
subgroup factors relevant to SAT Mathematical item 
performance, several activities were undertaken. First, 
se^^. ches were conducted of the ERIC and Psychological 
Abstracts databases to identify for each study group syntheses 
of the literature on mathematics and cognitive processing. 
Second, existing differential item functioning studies were 
reviewed for indications of how subgroup characteristics might 
affect item performance. Finally, individuals knowledgeable 
about the characteristics of the subgroups were contacted for 
suggestions . 

For both groups, these activities produced a limited 
amount of information. In particular, the literature searches 
uncovered surprisingly few relevant references. For visually 
impaired students, the available sources indicated three main 
factors that might produce differential functioning. The 
first factor related to the cognitive characteristics of 
visually impaired students. The major difference in cognitive 
characteristics between sighted students and those blind from 
birth relates, of course, to visual experience. Because of 
their more limited experience, blind students generally have 
less well-developed spatial abilities than their sighted 
peers. Differences in ability may manifest themselves on such 
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tasks as synthesizing shapes from their component parts 
(Warren, 1981), perceiving embedded figures (Witkin, Birnbaum, 
Lomonaco, Lehr, & Herman, 1968), using or understanding 
spatial language (Warren, 1981) , or working with complex 
figures (e.g., figures in three dimensions; figures with 
shaded portions in which the dots used for shading may be 
misinterpreted as braille notation) . 

A second factor is that the availability of sight makes 
some operations easier. For example, sighted examinees 
freguently can gather visually some of the information needed 
to arrive at a solution; for instance, the sizes of "special" 
angles (i.e., 30°, 60°) can be estimated and compared. 

Further, sighted students can sometimes eliminate incorrect 
options through visual inspection or even tentatively identify 
the correct one. Finally, they can construct diagrams as an 
aid in solving certain types of items (e.g., Venn diagrams for 
logic problems) . 

The third factor was related to braille. Reading in 
braille has associated with it several pertinent effects. 
First, text takes more space to represent. One result is 
visually impaired students who use this medium are generally 
slower readers than nonhandicapped students. Roman numeral 
item formats (i.e, where the answer options refer to 
statements identified by roman numerals) , lengthy word 
problems, and tables (which usually extend across two braille 
pages) may be more difficult to process because information 
takes longer to encode and must, therefore, be kept in short- 
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term memory longer. Another result is that units that can 
easily fit on a printed page must sometimes be broken up or 
reformatted. So, for example, questions must often be 
presented on a different page from the stimulus to which they 
refer and axis labels in graphs often are replaced with 
letters defined in a legend below the graph. Yet a third 
result is that figure labels (e.g., angle measurements) cannot 
be unambiguously placed unless the figure is substantially 
larger than the printed model. 

A second effect of reading in braille is that some 
symbols that have no special meaning in print have meaning in 
braille and, thus, can cause confusion. In addition, some 
meaningful symbols that are not easily confused in print are 
so in braille. For example, the letters "A-J" and the numbers 
"1-9" use — with the exception of a prefix — the same braille 
notation. 

For Black students, past research also offers hints about 
factors that might cause differential item difficulty. For 
example, Shepard, Camilli, and Williams (1984), using data 
from 10th and 12th graders participating in the H igh School 
and Beyond study, found verbally-loaded math items to show 
frequent indications of bias. During development of the Otis- 
Lennon Mental Ability Test, Scheuneman (1978) found math word 
problems to be more frequently biased on the llth-12th grade 
form of the test than items involving straightforward number 
manipulation. Finally, Scheuneman (1985) discovered GRE 
Quantitative items requiring the problem to be abstracted from 
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a verbal description to show similar indications of 
differential difficulty. In addition to more abstract, 
verbally-loaded math problems, Scheuneman's (1985) study 
discovered indications of differential functioning associated 
with geometry items, key placement (Black students appeared to 
use a guessing strategy based on selecting the middle option) , 
and quantitative comparison items that could be solved v/ith a 
diagram but which did not include one. 

Forming item categories . On the basis of the previous 
analysis, 13 overlapping Mathematical cluster structures were 
formed (structures were overlapping in that items belonged to 
more than one structure simultaneously) . Within each 
structure, items were organized into mutually-exclusive 
content categories. These categories included ones 
hypothesized to be differentially difficult and "baseline" 
categories not hypothesized to show differential functioning 
(several categories, such as Diagram size; medium , were 
borderline ones meant to separate categories that clearly 
should show differential functioning from those that clearly 
should not) . Table 3 presents the item categories composing 
each cluster structure and identifies those hypothesized to be 
differentially difficult. Definitions for each category are 
given in the appendix. 



Insert Table 3 about here 
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Several points about Table 3 should be noted. First, a 
cluster structure was tested for a population only if an 
expectation of differential functioning existed for that 
group. Second, for some cluster structures, only a small 
number of item categories were hypothesized to function 
differentially. The remaining categories in the structure 
were tested to provide the baseline information referred to 
above. Finally, no research could be located to generate 
hypotheses about the specific types of geometry items that 
might function differentially for Black students. 

Consequently, all categories were tested as an exploratory 
endeavor. 

For each of the categories listed in Table 3, one or more 
item clusters per form was constructed depending upon the 
number of available items meeting the category definition. In 
forming clusters, efforts were made to keep the number of 
items within the five-to-nine range. The five-item lower 
bound reduces the influence of guessing on item performance, 
making for more reliable behavioral indicators than could 
otherwise be obtained.. In addition, with a minimum of six 
score points, a reasonable interval scale can be achieved. 

The maximum of nine items was set to keep individual clusters 
from becoming too large with respect to the total number of 
items in the test. 

Because some theoretically-meaningful clusters had very 
few items (e.g., 3-dimensional solids), it was not always 
possible to maintain the five-item lower bound. Rather than 
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combine these instances into larger heterogeneous groupings, 
these "mini-clusters" were retained to be analyzed as clusters 
when composed of two-to-four items and as individual items 
when composed of a single test question. 

Identifying differentially functioning clusters . To 
determine if item clusters operated differently for the study 
groups, a covariance adjustment was used in which the linear 
regression of each item-type cluster score on the total 
Mathematical score for each reference group was computed. In 
performing these computations, rights-only raw scores were 
used and the cluster score was removed from the total score. 

A comparison of the standardized difference between the study 
and reference group means on rights-only, formula, and scale 
scores suggested that these scores were functionally 
equivalent for group matching purposes (see Table 4). 



Insert Table 4 about here 



Using the reference-group regressions, cluster scores for 
each study group were predicted from their members' total 
Mathematical scores (after removing the cluster score) . The 
predicted cluster mean for each group was then subtracted from 
that group's actual cluster mean, yielding a positive residual 
if the study group students did better than expected and a 
negative one when performance was lower than predicted. 
Finally, this residual was divided by the cluster standard 
deviation for the study group. A hypothesized cluster 
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category was said to be differentially difficult if its 
standardized residuals were equal to or less than -.2 on the 
majority of instances of that category across SAT forms and 
the associated baseline category generally showed no 
consistent evidence of differential difficulty. This 
criterion is suggested by Cohen, 1969, as a minimum for 
identifying the presence of meaningful effects in the social 
sciences. It is recognized that this criterion is somewhat 
arbitrary and that there is considerable debate over what size 
effect should be considered meaningful. However, previous 
analyses using this criterion have shown it to be a reasonably 
liberal one in identifying item effects (Bennett, Rock, & 
Kaplan, 1987). 

In a few cases, clusters were composed of only single 
items. In these instances, the differential difficulty 
criterion was set at the approximate equivalent of a 10 
percentage-point difference in the probability of passing the 
item (the statistic used to evaluate the functioning of 
individual items and the rationale for this criterion are 
described in the next section) . 

As noted, the procedure used is a form of covariance 
adjustment. In general, such adjustments require that: 
assumptions of linearity and parallelism of regression be met. 
In the present case, the use of item cluster scores as the 
dependent variable decreases the possibility of nonlinearity 
because such scores are continuous. Additionally, where 
nonlinearity exists, linear regression estimates should 
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provide reasonable approximations. Further protection against 
nonlinearity in the reference sample might have been provided 
by matching reference and study group subjects on total score 
before estimating cluster performance. This matching, 
however, would have reduced sample size and, consequently, the 
statistical power of the analysis. As a result, it was not 
implemented . 

With respect to the assumption of parallelism, the 
covariance adjustment used here follows the so-called "Belson 
model" (Belson, 1956; Snedecor & Cochran, 1980), in which the 
regression estimates from a larger comparison group are used 
to predict effects for a smaller treatment group. Under this 
model, parallel regressions are not assumed. Also, for the 
present study, the fact that the reference and study group 
regressions may not always be parallel is of little 
consequence because primary interest is in cluster performance 
differences at that point in the total score distribution 
where most study group individuals fall — that is, at the group 
mean. The focus, in other words, is only on whether there is 
a discrepancy between the performance of study group 
individuals and reference students operating at the same level 
as the majority of study group examinees. 

Determining differential item performance . The items 
composing a cluster were analyzed individually if (1) the 
category had been hypothesized to show differential 
difficulty, and (2) a majority of the clusters in the category 
behaved deviantly, and (3) associated baseline categories 
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showed, no consistent evidence of such functioning. Items were 
analyzed individually to determine if the category itself 
defined a potentially biased item type or, alternatively, if 
only a few aberrant items accounted for the unusual cluster 
performance. 

For each item composing an errant category, Mantel- 
Haenszel (M-H) statistics were generated (Holland & Thayer, 
1986) . The M-H procedure is a form of 2x2xs contingency table 
analysis with two groups (study and reference) each 
categorized by success or failure on an item and matched on s 
categories (the categories are typically s score levels of a 
test) . The M-H statistic compares the odds of success for the 
two groups and can be expressed as: 

^s ^bs '^fs / ^s 

“MH ~ 

^s Rfz Wbs / Ns 

where 

R = the freguency of right responses, 

W = the freguency of wrong responses, and 
N = the freguency of responses in stratum s. 

A freguently-used transformation for is to the delta- 

scale (Holland & Thayer, 1986) . This scale provides an 
effect-size estimate of differential item performance. The 
transformation is: 

^MH ~ -2.35 In (®mH^ * 
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For the M-H procedure, the matching criterion employed 
was SAT Mathematical rights-only raw score. This use of total 
score as a matching variable is supported by research which 
suggests that SAT Mathematical scores are unidimensional and 
that they generally have the same meaning across handicapped 
and nonhandicapped groups (Dorans & Lawrence, 1987; Rock, 
Bennett, & Kaplan, 1987; Willingham, Ragosta, Bennett, Braun, 
Rock, & Powers, 1988) . 

For both the Black/White and visually handicapped/ 
nonhandicapped comparisons, examinees were partitioned into 61 
levels based on SAT-Mathematical score. In addition, a 
correction for differential speededness given by Schmitt, 
Bleistein, and Scheuneman (1987) was used. This correction 
accounts for some subgroups differentially reaching items at 
the end of the test. The adjustment redefines the proportion 
correct at each score level from the total number of students 
getting the item correct divided by all students taking the 
test to the total correct divided by only those students who 
reached the item. 

Table 5 presents the mean number of items not reached and 
the mean omitted for the study and reference groups. Black 
students reached significantly fewer items than White 
examinees on all three forms, while visually impaired 
students, by virtue of receiving extra time, completed 
significantly more items than their sighted counterparts on 
two of the three forms. Visually impaired students also 
omitted more items than reference group students. However, 
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for visually impaired students, the SAT is more of a power 
test. As such, it is more likely that visually impaired 
students omit, not because they are unsure of the answers and 
are rushing to complete the exam as might their nonhahdicapped 
counterparts, but because they have thoroughly considered the 
items and do not know the answers. Therefore, attempting to 
correct for these differential omit rates does not seem 
justified . 



Insert Table 5 about here 



In the current study, an item was considered to show 
differential functioning if its was different from zero at 

the .05 level of significance and was of a practically 
important size. Because no conventional criterion for 
practical importance in the context of differential item 
operation exists, ny specification must be a judgment. For 
the delta scale transformations of the M-H statistic a 
difference of one unit has been suggested as a meaningful 
difference (Wild, 1987b) . Except for very hard or very easy 
items, such a difference is approximately equal to 10 
percentage points in the probability of passing an item (Wild, 
1987b) . 

Relating item and cluster performance . Large item 
effects do not necessarily help to explain cluster 
performance. For instance, contrary effects (e.g., a 
differentially easy item in a differentially difficult 

23 



o 




clusters as the Unit 



19 

cluster) , though relatively rare, not only fail to support the 
cluster result but may actually dampen the effects of items 
consistent with cluster findings. 

For the hypothesized clusters found to be differentially 
difficult, each item result was compared to the effect shown 
by the cluster. An item effect was considered helpful in 
explaining cluster performance if it was concordant with the 
differential difficulty of the parent. To determine whether 
item-level results suggested differential functioning for the 
broad class represented by a hypothesized cluster, the number 
of differentially functioning items was tabulated. An item 
category was said to evidence pervasive differential 
functioning if at least half of the items in a majority of 
clusters showed concordant differential effects. 

Evaluating the Cluster-Based Method 

To evaluate the cluster-based method, several analyses 
were performed. First, the number of significant item 
performances detected in the supported hypothesized categories 
was tabulated and compared to the number of unique items 
causing those performances. This analysis was intended to 
assess the effect of overlapping cluster structures on the 
interpretability of results: a high ratio of significant 

performances to unique items suggests that the same small core 
of items may be causing several categories in different 
cluster structures to operate deviantly. 

Second, the number of unique significant items detected 
in the supported hypothesized categc ies was compared with the 
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number of unique significant items that would have been 
detected had all differentially functioning categories been 
analyzed at the item level. This analysis speaks to the 
effect of using theoretical predictions to guide the analysis. 
If the theoretical predictions are meaningful, deviant 
categories for which no theoretical predictions were made 
should contain few additional differentially functioning 
items . 

Third, the number of unique significant items detected in 
the supported hypothesized categories was compared with the 
number of unique significant items that would have been 
detected by an analysis of all test items. The intention of 
this analysis was to estimate how comprehensive the cluster- 
based method was in detecting differential item functioning. 

To do this, M-H indices were computed for all items for which 
these statistics had not been previously generated; as before, 
items were considered to function differentially if their 
equalled or exceeded 1.0 and differed from zero at the .05 
level of significance. 

Fourth, the direction of differential functioning 
suggested by the cluster-based method was compared with that 
indicated by the item-level analysis. Based on the literature 
review, the cluster-based method predicted substantial 
differential difficulty for both study groups. If supported 
by the cluster-based empirical results, this finding should be 
c pable of confirmation through item level analysis: a] 1 
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other things being equal, on the whole test, more items should 
be differentially difficult than differentially easy. 

Finally, the meaning of the standardized residual 
statistic was assessed by computing its product-moment 
correlation with the cluster mean value (i.e., the mean 

taken over all items in a cluster) and by reviewing the 
size of the mean values for the supported categories. 

Results 

Visually Impaired Students 

Table 6 presents the standardized residuals for all 
tested cluster categories. Of those 14 cluster (categories 
hypothesized to show differential difficulty, six had a 
majority of their residuals exceeding the -.2 criterion in 
conjunction with baseline categories that did not show similar 
differential functioning. These six categories were Geometry; 
triangles , Spatial factor; possible spatial component . Spatial 
factor; estimation helpful in eliminating options . Stimulus 
format; figures . Stimulus format; graphs and tables , and 
Diagram size; small (the borderline category. Diagram size; 
medium, was also differentially difficult) . The residuals for 
the Geometry; triangles category must be considered only 
weakly supportive of the hypothesis, however; while strong 
indications of differential difficulty were clear for the WSA 
forms, the residual for the CSA form was just as strongly 
positive, contradicting any argument of consistent 
differential difficulty. Of the remaining five cluster 
categories, the most consistent evidence for differential 



Clusters as the Unit 



difficulty was found for Stimulus format; figures , for which 
six of six clusters exceeded the -.2 criterion, Diagram size; 
small . for which three of three clusters were differentially 
difficult, and Spatial factor; estimation helpful , for which 
two of two met the cut-off. 



Insert Table 6 about here 



Of the 14 categories hypothesized to be differentially 
difficult, eight did not show meaningful evidence of such 
functioning either because their cluster residuals were not 
consistently differentially difficult or because, while they 
were, so were the clusters belonging to their associated 
baseline categories. Among the former were Geometry; 3-D 
solids , Miscellaneous; newly defined operations . Spatial 
factor; no figure, but drawing helpful . Reading load; high . 
Graphic placement; separated from item , and Shading; shaded . 
Cluster structures in which both the hypothesized and baseline 
categories were differentially difficult included Embedded 
figures and Label A-J . 

Finally, there were two instances in which cluster 
structures had some baseline categories that were differen- 
tially difiicult and others that were not. Both of these 
instances represented collections of items that did not fit 
under any well-defined grouping. These were Geometry; other 
geometry and Miscellaneous; other miscellaneous . 
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Table 7 presents the proportion of differentially 
difficult items in each supported hypothesized category (the 
differentially functioniiig borderline category, Diagram sizeL 
medium is also included) . As the table indicates, in several 
instances at least half of the items were differentially 
difficult for a majority of clusters: Geometry: triangl es. 

Spatial factor: estimation helpful in eliminatin g options. 
Stimulus format: figures , and Diagram size: sma ll/medium. 
Again, the Geometry: Triangles category met the cut-off 
because of the behavior of items on the WSA forms; items on 
the CSA form showed little indication of differential 
difficulty. 



Insert Table 7 about here 



An indication of the extent to which these results are 
affected by the presence of overlapping cluster structures can 
be gained from a comparison of the number of significant item 
performances to the number of unique significant items. Over 
the six supported hypothesized categories and one supported 
borderline category, 17 individual item performances were 
significant for WSA3 , 22 for WSA5, and 19 for CSA5 . These 
significant performances were due to a small core of items — . 
eight on each of the WSA forms and seven on CSA5 — that 
repeatedly appeared in different cluster structures. 

One means of assessing the extent to which theoretical 
predictions were helpful in guiding the analysis is by 
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comparing the number of unique significant items found. in the 
hypothesized categories to the number found in all differen- 
tially functioning categories. When all significant 
categories were considered regardless of hypathesis, only one 
more significant item was detected on each of the WSA forms 
and none on CSA5. 

To determine how comprehensive the cluster-based method 
was in detecting differential item functioning, the proportion 
of unique significant items located by the method was computed 
as a function of al] unique significant items on all forms of 
the test. The supported hypothesized categories accounted for 
57% (8 of 14) , 53% (8 of 15) , and 64% (7 of 11) of significant 
items, respectively. 

Fourth, the direction of differential functioning 
suggested by the cluster-based method was compared with the 
results of analyzing all items. The cluster-based method 
predicted numerous instances of differential difficulty and 
located substantial supporting evidence. When all items on 
all three forms were analyzed, the number of differentially 
difficult items was almost three times the number of 
differentially easy ones (see Table 8) . 



Insert Table 8 about here 



Finally, to better understand the meaning of the 
standardized residual, (1) the mean value for each cluster 

was correlated with the cluster standardized residual and (2) 
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the level of cluster differential difficulty indicated by the 
standardized residual was compared for each cluster with the 
degree of difficulty suggested by the mean For all 

analyzed clusters, the correlations were .85, .95, and .98 for 

WSA3, WSA5, and CSA5 , respectively, suggesting substantial 
overlap in the phenomenon being measured. With respect to 
level of differential difficulty, table 9 lists the mean A^^j^ 
values for the supported hypothesized categories. As the 
table indicates, these values generally support the inferences 
made from the standardized residuals: most of the supported 

hypothesized categories had a majority of their clusters 
showing significant differential difficulty (i.e., mean values 
greater than —1.00) . In addition, as with the other indices, 
the Geometry; triangles category evidenced contradictory 
results (differential difficulty for the WSA forms but 
differential easiness for the CSA form) . 



Insert Table 9 about here 



Black Students 

Table 10 presents the standardized residuals for all 
cluster categories tested for Black students. None of the six 
hypothesized categories had a majority of its residuals 
exceeding the -.2 criterion. Three nonhypothesized 
categories, however, met the criterion. These were Spatial 
factor; possible spatial component . Reading difficulty; easy , 
and Concrete/abstract; concrete . 
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Insert Table 10 about here 



Because none of the hypothesized categories proved 
significant, no individual items were detected via the 
cluster-based method. When all significant categories were 
considered regardless of hypothesis, two unique significant 
items each were detected on CSA5 and CSA7 , and three on GSA2 
(these seven items accounted for all the differentially 
difficult items on all three forms) . Finally, when values 

were computed for all items on all forms, the number of 
differentially difficult items was half the number of 
differentially easy ones (see Table 11) , supporting the 
general absence of differential difficulty indicated by the 
cluster-based method. 



Insert Table 11 about here 



Correlations between the mean values and the 
standardized residuals for all analyzed clusters were .86, 
.86, and .84 for CSA5, CSA7 , and GSA2 respectively, somewhat 
lower than for the visually impaired groups. 

Discussion 

The purpose of this study was to develop, try out, and 
evaluate a theory-based method of detecting the underlying 
causes of differential difficulty. Two population subgroups 
taking SAT-M — visually impaired students administered braille 
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test editions and Black students — were chosen, and the 
literature on their mathematical test performance searched so 
that hypotheses about the causes of differential performance 
could be posed. 

For the visually impaired group, the methodology was 
moderately successful. The data supported the hypothesis of 
differential difficulty in several SAT-M cluster categories. 
Statistical analysis of individual items bolstered these 
cluster-level results, producing a reasonable set of 
characteristics that might make for differential difficulty: 
items in which figures were presented as part of the stimulus, 
which had small-to-medium sized diagrams, or in which 
estimation was helpful in eliminating options. Item-level 
analysis of all questions on all forms confirmed the existence 
of substantial differential difficulty for these students. At 
the cluster level, the standardized residuals were very highly 
correlated with the cluster mean suggesting that the two 

indices were measuring very similar phenomena. The 
differences between the two might be owed to the speededness 
correction applied to and/or to the fact that with such 

small samples, Aj^ might be somewhat less reliable (since it 
is an average of individual item values) . 

For Black examinees, the method was effective in that it 
generally agreed with the results of the item-level analysis: 
neither method detected any consistent evidence of 
differential difficulty. In addition, the standardized 
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residuals and cluster mean values were highly correlated, 

though not as highly as in the visually impaired group. 

While the cluster-based method showed moderate success, 
considerable limitations were apparent. For instance, in both 
populations the majority of hypotheses went unsupported (none 
were supported for the Black examinee samples) , while several 
baseline categories were. 

Why did the method fail to find differential functioning 
where it allegedly should and find it where it allegedly 
shouldn't? For both groups, the research base on the 
cognitive processes associated with mathematics skill 
development was found to be sparse and the results offered 
largely unreplicated, making for weak theoretical 
propositions. While such a limited base provides an 
acceptable foundation for exploratory analyses, it greatly 
restricts the power of predictions and the meaning that can be 
ascribed to results. 

Besides weak theory, a second possible confounding factor 
is the standardized residual criterion established for 
identifying differential cluster functioning. Among the Black 
samples, two nonhypothesized categories were significant 
because of the presence of one or two differentially difficult 
items per cluster. In these cases, the categories clearly did 
not represent an aberrant item type. To reduce the influence 
of single items on cluster functioning, the criterion might be 
made a less liberal. Simulation research might help in 
identifying the criterion scores that have the greatest 
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likelihood of identifying categories with pervasive 
differential functioning while producing an acceptable number 
of false positives and negatives. 

An additional shortcoming evident in the results from the 
visually impaired samples is that even when hypotheses are 
supported they may not be the actual causes of differential 
functioning; other dimensions common to the items in a 
category might have caused the observed effect. For example, 
in the Embedded cluster structure both the hypothesized 
category (items with embedded figures) and its negation (items 
with nonembedded figures) showed strong, consistent evidence 
of differential difficulty. The same situation held for the 
Label A-J structure. Clearly, the hypothesized dimensions are 
not at the root of the differential difficulty observed here. 
In both cases, a plausible explanation is that the items 
composing the structures were subsets of the Stimulus format; 
figures structure, which showed pervasive differential 
difficulty at the item level. 

As suggested, the problem of overlapping structures is a 
considerable one. When such structures are posed, the same 
small core of differentially functioning items can cause 
categories in several structures to function aberrantly, 
making unclear what dimension is actually causing the observed 
effect. This limitation is partially a function of a weak 
theory of differential functioning for the subgroup, but also 
a result of the quasi-experimental nature of the method: 
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without experimental control of item characteristics, strong 
tests of hypotheses cannot be conducted. 

Finally, the method was able to locate in the visually 
handicapped group only about half the items found to be 
differentially difficult by the item-level method and, in the 
Black group, none of seven items. For the Black student 
group, the number of significant items detected by the item- 
level method is at about the chance level. For the visually 
impaired group, the number is larger, again pointing to the 
limited utility of the research base and the resulting 
hypotheses in fully accounting for differential difficulty.* 

How might the cluster-based method be applied most 
productively? As noted, one major improvement is for 
targeting analyses at a small number of carefully selected 
hypotheses derived from a reasonably strong research base. In 
the absence of such a base, analyses can be no more than 
exploratory and limited interpretability should be expected to 
result. Second, to the extent possible, overlapping cluster 
structures should be avoided. When overlapping structures are 
indicated, the structure that is the more theoretically 
supportable should be selected. Third, results might be 
supplemented with more powerful methodologies. In particular, 
experimental designs are needed to allow more definitive tests 
of the hypotheses generated through quasi-experimental 
cluster-based designs. Such studies allow a degree of control 
of item content that cannot be achieved when working with 
intact tests — the type of manipulation needed to make stronger 
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inferences about the causes of differential difficulty. Also 
of value might be protocol analyses where a small number of 
subgroup members are asked to solve aloud problems from 
suspect item classes. Results can help to confirm hypotheses 
supported through quasi-experimental research or, 
alternatively, help build the database needed to specify 
hypotheses more effectively. 

In sum, under the proper conditions, the cluster-based 
method would seem a potential incremental improvement over 
post-hoc, item-level approaches to differential functioning. 
The approach would appear better if for no other reason than 
that it is more conservative: the focus is on determining if 

the data endorse one's propositions rather than on 
constructing explanations to support the data. It is equally 
clear, however, that the method has important limitations that 
can be avoided best by applications in which a relatively 
strong set of predictions can be derived from a sound research 
base. 
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Table 1 

Background Data for Visually Impaired (VI) 
and Nonhandicapped (NH) Examinees 



3 c 



Form 


N 


% of 
Males 


Mean 

Aqe 


SAT-M 

MeanfSD) 


SAT-V 

MeanfSD) 


Mean 

Math 

Grade 


Mean 
# Years 
of Math 


WSA3 


VI 


91 


47% 


18.2 


381(112) 


421(124) 


2.5 


3.1 


NH 


1110 


46% 


17.1 


498(118) 


451(111) 




M M M 


WSA5 


VI 


96 


55% 


17.9 


442(140) 


444(136) 


2.9 


3 . 0 


NH 


1398 


47% 


16.7 


486(114) 


444(113) 


""" 




CSA5 


VI 


74 


50% 


18.3 


402(127) 


422(127) 


2.9 


3 . 7 


NH 


5507 


44% 


17.3 


471f 112) 


436fl03) 


3.0 


3 . 6 


Note. 


Math 


grades 


and number 


■ of years 


of math are self- 


-reported data 


taken 


i from 


the Student Descriptive Questionnaire. 


Math 


grade is the 



grade received in the most recently taken math course. Number of 
years of math is the number the student expects to complete by the end 
of high school. For math grades, the Ns for visually handicapped 
students were 37, 53, and 26 for WSA3 , WSA5, and CSA5, respectively. 
For number of years of math, the Ns were 39, 4S, and 28. Data on 
these variables were not available for nonhandicapped examinees taking 
WSA3 or WSA5. 
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Table 2 



Background Data for Black and White Examinees 















Mean 


Mean 






% of 


Mean 


SAT-M 


SAT-V 


Math 


# ^ears 


Form 


N 


Males 


Aae 


MeanfSD) 


MeanfSD) 


Grade 


of Math 



CSA5 



Black 


446 


34% 


17 . 3 


363(94) 


345(92) 


2.7 


3 . 6 


White 


4405 


44% 


17.3 


482(106) 


447(98) 


3 . 0 


3.6 


CSA7 


Black 


834 


36% 


17.4 


359 (89) 


332(89) 


2.6 


3 . 3 


White 


4798 


46% 


17.4 


466(107) 


431(98) 


2.8 


3.5 


GSA2 


Black 


705 


36% 


17.5 


366(86) 


335(88) 


2.5 


3.6 


White 


3985 


48% 


17.6 


465(112) 


422 (98^ 


2.8 


4.0 



Note. For the CSA forms, math grade is the grade received in the most 
recently taken math course; for GSA2, it is the average of grades 
received in all math courses. Math grades and number of years of math 
are self-reported data taken from the Student Descriptive 
Questionnaire. Number of years of math is the number the student 
expects to complete by the end of high school. 
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Table 3 

Item Cluster Structures and Cluster Categories Hypothesized 
to be Differentially Difficult 



Cluster Structure & Category 
Geometry 
triangles 
polygons 
3-D solids 
other geometry 
Miscellaneous 
number properties 
newly defined operations 
other miscellaneous 
Spatial factor 
no figure, but drawing 
helpful 

possible spatial component 
estimation helpful in 
eliminating options 
ordinary geometry 
Reading difficulty 
difficult 
medium 
easy 

Concrete/abstract 

concrete 

abstract 

Stimulus format: Picture 
figures 

graphs and tables 
Reading load 
high 
medium 
low 
Key 

key '’c" 
not key ”c" 

Graphic placement 
separated from item 
not separated 
Shading 
shaded 
not shaded 
Diagram size 
small 
medium 
large 

Embedded Figures 
embedded 
not embedded 
Label A-J 

A-J 



Visually Im- Black 

paired Students Students 

X X 

X 

X X 

X 

X 

X X 

X 

X 

X 

X 

X 
X 

X X 



X 

X 

X 

X 
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. Table 4 



Standardized Mean Differences Between Groups on 
Mathematical Raw and Scale Scores 







Standardized Difference 


Contrast 


Rights- 

Onlv 


Formula 

Score 


Scale 

Score 


visually Impaired/ 
Nonhandicapped 


WSA3 


.97 


.97 


1.00 


WSA5 


.44 


.39 


. 39 


CSA5 


.65 


.66 


.62 


Black/White 


CSA5 


1.10 


1.12 


1.12 


CSA7 


.99 


.99 


.99 


GSA2 


. 88 


.91 


.89 



Note . Standardized differences were computed using the standard 
deviation of the appropriate reference group. 
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Table 5 



Mean Number of Items Omitted and Not Reached for Study Groups 









Visually Impaired 


Group 








Mean # Not 


Reached 


Mean # 


Omitted 






Visually 


Nonhandi- 


Visually 


Nonhandi- 


Form 




Impaired 


capped 


Impaired 


capped 


WSA3 




1.7^ 


2.8 


8.1^ 


4.4 


WSA5 






1.5 


Q cl 

o • ^ 


4.8 


CSA5 




1.2^ 


2.6 


7.1® 


4.6 


S 2 < 


. 05, 


Z. (one-tailed) 


= -2.05 






^ p < 


. 01, 


z (one-tailed) 


= -3.21 






S 2 < 


. 001, 


z (two-tailed) 


= 6.19 






^ E < 


.001, 


z (two-tailed) 


= 5.59 






® E < 


. 001, 


z (two-tailed) 


= 4.25 







Black Group 



Form 




Mean # Not Reached 


Mean # 


Omitted 


Black 


White 


Blark 


White 


CSA5 




3.2® 


2.5 




4.6 


CSA7 




2.5^ 


1.9 


5. 3^^ 


4.4 


GSA2 




2.8® 


2.1 


6.0 


5.6 


S 2 < 


. 001, 


z (one-tailed) 


= 4.05 






^ P < 


. 001, 


z (one-tailed) 


= 4.87 






^ E < 


. 001, 


z (one-tailed) 


= 5.08 






p < 


. 001, 


z (two-tailed) 


= 4.09 
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Table 6 



Standardized Residuals for Visually Impaired 
taking the Braille Edition of SAT-M 


Students 


uster Cateaorv 


WSA3 


WSA5 


CSA5 


Geometry 


triangles 


-.44 


-.35 


. 


polygons 


.27 


. 03 


-.76 


3-D solids 


— — — 


. 02 


. 14 


other geometry 


-.25 


-.29 


12 


Miscellaneous 


number properties 


-.41 


. 07 


.15 


newly defined 


operations 


-.42 


NSI 


NSI 


other miscellaneous 


-.77 


-. 28 


- . 49 


Spatial factor 
no figure, but drawing 


helpful 


NSI 


. 04 


» M M 


possible spatial 


component 


-.35 


05, -.26 


-.42 


estimation helpful in 


eliminating options 


— 


-.37 


-.66 


ordinary geometry 


-.47, -.02 


-.16 


-.12, .41 


Stimulus format: Picture 


figures 


-.43, -.49 


-.41, -.33 


-.26, -.47 


graphs and tables 


NSI 


-.41 


-.25 


Reading load 


high 


-.34, .41 


-.08, .18 


. 15, -. 08 


medium 


-.13, -.49 


-. 07 , -. 16 


.14, .08 


low 


-.01, .20 


-.18, .07 


. 18, -.07 


Graphic placement 


separated from item 


-.06 


-.47 


-.07, -.73, .09 


not separated 


-.71, -.13 


-.35, -.26 




Shading 


shaded 


. 17 


-.42 




not shaded 


-.49, -.51 


-.26 


— — — 


Diagram size 


small 


-.41 


-.26 


-.20 


medium 


— 


-.48 


-.52 


large 


. 10 


-.36 


-.07 


Embedded Figures 


embedded 


-.16 


-.58 


-.41 


not embedded 


-.64 


-.27 


-. 27 , -. 32 


Label A-J 


A-J 


-.16 


-.55 


-.42 


not A-J 


-.53 


-.34 


-.31 


Note. NSI = a single, nonsignificant 


item. 
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Table 7 

Proportion of Differentially Difficult Items in Each Supported 
Hypothesized Cluster Category for Visually Impaired Students 



Cluster Cateaorv 


WSA3 


WSA5 


CSA5 


Geometry 

triangles 


3/5 


2/4 


1/5 


Spatial factor 
possible spatial 
component 


3/6 


1/4, 3/6 ■ 


2/6 


estimation helpful in 
eliminating options 




2/4 


4/6 


Stimulus format: Picture 
figures 


3/6, 5/8 


2/5, 4/7 


3/5, 3/7 


graphs and tables 


0/1 


2/3 


1/6 


Diagram size 
small 


3/5 


2/3 


2/6 


medium 


— 


4/6 


3/7 



Note . An item was considered to be differentially difficult if its 
Awjj equalled or exceeded 1.0 and differed from zero at the .05 level 
dr significance. 
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Table 8 



Numbers of Differentially Difficult vs. Differentially Easy Items for 
Visually Impaired Students when all Items Across all Forms are 

Assessed 



Form 


Differentially 

Difficult 


Differentially 

Easv 


WSA3 


14 


3 


WSA5 


15 


4 


CSA5 


11 


8 


TOTAL 


40 


15 



Note . An item was considered to be differentially difficult if its 

exceeded 1.0 and differed from zero at the .05 level 
of significance. Differential easiness was defined as a equal to 

or less than -1.0 and differing from zero at the .05 level of 
significance. 
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Table 9 



Mean Aj^ Values of Supported Hypothesized Cluster Categories for 

Visually Impaired Students 



Cluster Cateaorv 


WSA3 


WSA5 


CSA5 


Geometry 

triangles 


-1.03 


-1.24 


.78 


Spatial factor 
possible spatial 
component 


-1.11 


-.02,-1.11 


-1.13 


estimation helpful in 
eliminating options 




-1.28 


-1.87 


Stimulus format: Picture 
figures 


-.91,-1.67 


-1. 36,-1. 14 


-.89,-1.12 


graphs and tables 


-.32 


-2.02 


-.52 


Diagram size 
small 


-1.65 


-1.14 


-.54 


medium 




-1.60 


-1.29 
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Table 10 

Standardized Residuals for Black Students taking SAT-M 



Cluster Cateaorv 


CSA5 




CSA7 




GSA2 


Geometry 












triangles 


-.02 




-.16 




-.04 


polygons 


-.34 




-.01 




.24 


3-D solids 


.23 




-.29 




NS I 


other geometry 
Spatial factor ' 


-.08 




. 03 




-.20 


no figure, but drawing 
helpful 

possible spatial 


— 




-.06 




-.09, .09 


component 

estimation helpful in 


-.47 




-.44 




-.32 


eliminating options 


-.33 




-.03 




NS I 


ordinary geometry 
Reading difficulty 


. 06 




.09 




.15 


difficult 


-.16 




-.14 




. 13 


medium 


-.18 




-.26 




.03 


easy 

Concrete/abstract 


-.28 




. 13 




-.20 


concrete 


-.32, 


-.27 


-.25, 


-.12 


-.20,-. 15 


abstract 


-.25, 


-.18 


-.23, 


-.26 


-.21, -.29 




-.25, 


.03 


.23, 


-.26 


-.10, .00 




.27, 


.44 


.32, 


. 07 


.20, .24 


Reading load 












high 


-.07, 


-.14 


-.07, 


-.10 


. 01, -. 22 


medium 


-.08, 


-.03 


-.14, 


. 15 


-.06, .05 


low 

Key (Regular math) 


-.07, 


.03 


-.03, 


-.25 


-.21, .06 


key "c" 


-.18, 


.26 


-.16, 


-.17 


. 10, -.08 


not kev "c" 


. 02 . 


-.21 


-.29 , 


-.10 


-.10. -.06 
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Table 11 



Numbers of Differentially Difficult vs. Differentially Easy Items for 
Black Students when all Items Across all Forms are Assessed 



Form 


Differentially 

Difficult 


Differentially 

Easv 


CSA5 


2 


4 


CSA7 


2 


7 


GSA2 


3 


3 


TOTAL 


7 


14 



Note . An item was considered to be differentially difficult if its 

exceeded 1.0 and differed from zero at the .05 level 
or significance. Differential easiness was defined as a equal to 

or less than -1.0 and differing from zero at the .05 level of 
significance. 
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Appendix: Cluster Category Definitions 

Geometry 

Triangles: item involves one or more triangles. 

Polygons: item involves one or more polygons, other than 
triangles . 

3-dimensional solids: item involves one or more 3-dimensional 
solids. 

Other Geometry: item involves points, rays, lines', or angles 
in a plane; circles; or coordinate geometry (i.e., number line 
or rectangular coordinate system) • 

Miscellaneous 



Number Properties: item concerns the structure of the number 
system or elementary number system concepts. 

Newly defined operations: item contains special symbols or 
made up definitions. 

Other miscellaneous: item concerns new concepts, probability, 
geometric perception, or sets. 

Spatial Factor 

No figure, but drawing helpful: item does not have a figure 
associated with it but making a sketch or drawing would help 
in solving it. 

Possible spatial factor: item may require spatial skills to 
generate a solution. 

Estimation helpful: spatial estimation appears helpful in 
eliminating at least two of the options. 

Ordinary geometry: item can be solved by reference to factual 
relationships, rather than by spatial intuition. 

Reading Difficulty (stem only) 

Difficult: items containing compound sentences and/or large 
numbers of words perhaps requiring logic to sort out the 
meaning. Items which require careful reading. 

Example; Worker W produces n units in 5 hours. Workers V 
and W, working independently but at the same time, 
produce n units in 2 hours. How long would it take V 
alone to produce n units? 
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Medium; items with less verbiage; contain a simple word, 
phrase, or short sentences. Meaning is readilY clear. 

Example; A certain photocopying machine can make 10 
copies every 4 seconds. At this rate, how many copies 
can the machine make in 6 minutes? 



Easy; items which do not contain words or items which contain 
only a few (at most) standard words, such as (a) if , then 

if 



(to) 



and 



(d) in the figure above. 



(c) 



and 



then and 



Example; 



(a) 

(to) 

(c) 

(d) 



+ X = 



If y/x = -1, then y 
X = 9 and y = 3 

If 2x + 3y = 15, and y = 1, then 2x = 

In the figure above, x = 

(without any further explanation given, 
other than the figure. If a more detailed 
explanation is given in th stem, the item 
would be considered to fall in the mediu m 
category. ) 



Concrete/Abstract 



Concrete; guestions which are real“lif© word problems. 

Example; A supervisor was paid for her travel expenses at 
the rate of $0.20 per mile. If she received $14.40, for 
how many miles was she paid? 

Abstract; questions that do not involve real-life settings. 

Example; What is the sum of the areas of two squares with 
sides of lengths 1 and 3, respectively? 

stimulus Format; Pictures 

Figures; item contains a figure or picture that does not have 
a coordinate system (has a triangle, square, rectangle, line 
segment, etc.). 

Graphs and Tables; item has a coordinate system or is a line, 
bar or circle graph; is a number line; or item has data 
presented in rows and columns. The latter includes magic 
squares and times tables. 



Beading Load 

High; the number of words in the item is in the highest 
quartile for all items in the test section. 

Medium; the number of words in the item is in the middle 50% 
of all items in the test section. 
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Low: the number of words in the item is in the lowest quartile 
for all items in the test section. 



Key "c": item is a five choice item with the correct answer 
corresponding to "c." 

Not key "c": item is a :xve choice item with the correct 
answer corresponding to "a," "b," "d," or "e." 

Graphic Placement 

Separated from item: the item contains a figure that is placed 
on a page separate from the text of the item. 

Not Separated: the item contains a figure that appears on the 
same page as the item proper. 

Shading 

Shaded: the item contains a shaded figure. 

Not Shaded: the item contains a figure that is not shaded. 

Figure Size 

Small: the figure is among the smallest third in area across 
all figures across forms (less than 15 square inches) 

Medium: the figure is among the middle third in area across 
all figures across forms (from 15 to 26 square inches) 

Large: the figure is among the largest third in area across 
all figures across forms (more than 26 square inches) 

Embedded Figures 

Embedded: the item contains a geometric figure that has 
another geometric figure embedded in it to which the item 
refers . 

Not Embedded: the item contains a figure that does not have 
another geometric figure within it. 
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